Introduction to Data Science with R

Session 2: Data exploration basics

Ina Bornkessel-Schlesewsky

November 8, 2023

A bit of housekeeping

  • Everyone on the waiting list should have been accepted into the course
  • “Aktive Teilnahme”: complete the weekly exercises and submit them via Ilias by 12 pm of the following Wednesday (e.g. submit the exercises for today’s session by 15/11, 12 pm)
  • “Modulabschlussprüfung”: complete weekly exercises and submit a report at the end of the course

Data exploration and visualisation: first steps

Reproducibility checklist


What does it mean for a data analysis to be “reproducible”?

  • Are the tables, figures (and any other results reported) reproducible from the code and data?
  • Does the code actually do what you think it does?
  • In addition to what was done, is it clear why it was done?

Reproducibility toolkit

  • Scriptability \(\rightarrow\) R
  • Literate programming (code, narrative, output in one place) \(\rightarrow\) Quarto
  • Version control \(\rightarrow\) Git / GitHub (more on this later)

Adapted from https://github.com/rstudio-education/datascience-box/blob/master/course-materials/slides/u1-d02-toolkit-r/u1-d02-toolkit-r.Rmd

A bit more on R and RStudio

Note: this is the default layout - all the panes can be moved around to best suit your workflow. Note also the possible change of appearance (e.g. dark themes).

R console

  • this is where the “magic happens”, i.e. where the calculations take place
  • think of a calculator on steroids 😂
  • we can interact with the console directly (see example)
  • mostly, however, we will be working with scripted input

Cjp24 CC BY-SA 4.0, via Wikimedia Commons

Scripted R code


  • An R “script” is simply a text file with a collection of R commands
  • Think of it as a set of instructions that you can feed into your “calculator”
  • This is an important step towards reproducibility, as it means that you have a record of your analysis and you can recreate it at any time

Quarto


  • Mixes text (formatted using Markdown) and R code
  • Allows for documentation of analysis steps (the “why”): the next step towards reproducibility
  • Easily generate reproducible reports in different formats (.html, .pdf, .docx)
  • You can even use Quarto to create slides for presentations (these slides were created in Quarto), interactive tutorials and web sites.

Quarto

Quarto rendered document

Your turn: let’s explore the penguins data with the help of a Quarto document!


  • For this, we will be using a technique called “live coding”.
  • I will show you how to construct R code, talking you through the rationale for various coding choices.
  • By copying my code in real time, you will get a feel for how the process works.
  • Always type the code; don’t copy and paste it, even if certain aspects keep repeating. This will ensure that you start to build up the “muscle memory” required to conduct data exploration and visualisation yourself in future.


  • For this session, we will be working with the file basic_data_exploration.qmd, which you can find under exercises > week2_data_exploration

Key learning goals

By the end of this session, you should

  • understand the basics of how to work with a Quarto document for data exploration
  • know how to inspect a data frame
  • be able to undertake simple data manipulation steps (filtering, sorting, summarising)
  • be able to create basic graphs of different types using the ggplot() function

Resources

For your reference

Using basic data exploration functions

  • Basic data exploration functions are typically verbs to reminder you that they “do stuff” with a dataset (e.g. filter, arrange, summarise)
  • start with the data that you want to explore / manipulate (e.g. penguins)
  • use the “pipe” operator |> to send the data to a function (N.B. keyboard shortcut for the pipe: CTRL/CMD + SHIFT + M)
  • specify the function and any additional parameters in parentheses (these differ depending on the function)
  • this basic pattern works for all the functions we’ve looked at and more!
penguins |> 
  filter(species == "Adelie")


penguins |> 
  arrange(body_mass_g)


penguins |> 
  summarise(m_mass = mean(body_mass_g,
                          na.rm=TRUE))

Combing basic functions

  • You can use pipes to chain together different functions
  • This is a very powerful basis for data exploration and visualisation!
penguins |> 
  filter(species == "Adelie") |> 
  arrange(body_mass_g)


penguins |> 
  filter(island == "Dream") |> 
  summarise(m_mass = mean(body_mass_g,
                          na.rm=TRUE))

(Very) basic plotting

Anatomy of a ggplot

Useful RStudio keyboard shortcuts


  • insert code chunk: CMD (Mac) / CTRL (Windows) + Option / ALT + i
  • insert pipe: CMD / CTRL + M
  • run block of code: CMD / CTRL + Return / Enter
    • you can place the cursor anywhere in a connected block of code for this
    • if you want to run only a particular selection of code, say part of a a block, you can select it using the cursor and then use this keyboard shortcut

Exercises


Complete the exercises in the week2_exercises.qmd document (in the exercises > week2_data_exploration directory) and upload the rendered .html output to Ilias.

Note: you will need to download the html file from Posit cloud to your own computer and then upload it to Ilias.